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Abstract 

Collaborative recommendation is an information-filtering technique 
that attempts to present information items that are likely of interest 
to an Internet user. Traditionally, collaborative systems deal with 
situations with two types of variables, users and items. In its most 
common form, the problem is framed as trying to estimate ratings 
for items that have not yet been consumed by a user. Despite wide- 
ranging literature, little is known about the statistical properties of 
recommendation systems. In fact, no clear probabilistic model even 
exists which would allow us to precisely describe the mathematical 
forces driving collaborative filtering. To provide an initial contribution 
to this, we propose to set out a general sequential stochastic model for 
collaborative recommendation. We offer an in-depth analysis of the 
so-called cosine-type nearest neighbor collaborative method, which is 
one of the most widely used algorithms in collaborative filtering, and 
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analyze its asymptotic performance as the number of users grows. We 
establish consistency of the procedure under mild assumptions on the 
model. Rates of convergence and examples are also provided. 

Index Terms — Collaborative recommendation - cosine-type similar- 
ity - nearest neighbor estimate - consistency - rate of convergence. 

AMS 2000 Classification: 62G05, 62G20. 

1 Introduction 

Collaborative recommendation is a Web information-filtering technique that 
typically gathers information about your personal interests and compares 
your profile to other users with similar tastes. The goal of this system is to 
give personalized recommendations, whether this be movies you might enjoy, 
books you should read or the next restaurant you should go to. 

There has been much work done in this area over the past decade since the 
appearance of the first papers on the subject in the mid-90's (Resnick et al. 
[13] . Hill et al. [Tl], Shardanand and Maes [16]). Stimulated by an abundance 
of practical applications, most of the research activity to date has focused 
on elaborating various heuristics and practical methods (Breese et al. [1], 
Heckerman et al. [lOj, Salakhutdinov et al. [13]) so as to provide personal- 
ized recommendations and help Web users deal with information overload. 
Examples of such applications include recommending books, people, restau- 
rants, movies, CDs and news. Websites such as amazon.com, match.com, 
movielens.org and allmusic.com already have recommendation systems in 
operation. We refer the reader to the surveys by Adomavicius and Tuzhilin 
[3] and Adomavicius et al. ^ for a broader picture of the field, an overview 
of results and many related references. 

Traditionally, collaborative systems deal with situations with two types of 
variables, users and items. In its most common form, the problem is framed 
as trying to estimate ratings for items that have not yet been consumed by 
a user. The recommendation process typically starts by asking users a series 
of questions about items they liked or did not like. For example, in a movie 
recommendation system, users initially rate some subset of films they have 
already seen. Personal ratings are then collected in a matrix, where each row 
represents a user, each column an item, and entries in the matrix represent 
a given user's rating of a given item. An example is presented in Table [T], 
where ratings are specified on a scale from 1 to 10, and "NA" means that 
the user has not rated the corresponding film. 
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NA 


3 


3 


4 
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? 



Table 1: A (subset of a) ratings matrix for a movie recommendation system. 
Ratings are specified on a scale from 1 to 10, and "NA" means that the user 
has not rated the corresponding film. 



Based on this prior information, the recommendation engine must be able to 
automatically furnish ratings of as- yet unrated items and then suggest appro- 
priate recommendations based on these predictions. To do this, a number of 
practical methods have been proposed, including machine learning-oriented 
techniques (e.g., Abernethy et al. statistical approaches (e.g., Sarwar et 
al. [IS]) and numerous other ad hoc rules (Adomavicius and Tuzhilin |2]). 
The collaborative filtering issue may be viewed as a special instance of the 
problem of inferring the many missing entries of a data matrix. This field, 
which has very recently emerged, is known as the matrix completion problem, 
and comes up in many areas of science and engineering, including collabora- 
tive filtering, machine learning, control, remote sensing and computer vision. 
We will not pursue this promising approach, and refer the reader to Candes 
and Recht [6] and Candes and Plan [5] who survey the literature on matrix 
completion. These authors show in particular that under suitable conditions, 
one can recover an unknown low rank matrix from a nearly minimal set of 
entries by solving a simple convex optimization problem. 

In most of the approaches, the crux is to identify users whose tastes/ratings 
are "similar" to the user we would like to advise. The similarity measure 
assessing proximity between users may vary depending on the type of ap- 
plication, but is typically based on a correlation or cosine-type approach 
(Sarwar et al. [15]). 

Despite wide-ranging literature, very little is known about the statistical 
properties of recommendation systems. In fact, no clear probabilistic model 
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even exists allowing us to precisely describe the mathematical forces driving 
collaborative filtering. To provide an initial contribution to this, we propose 
in the present paper to set out a general stochastic model for collaborative 
recommendation and analyze its asymptotic performance as the number of 
users grows. 

The document is organized as follows. In section 2, we provide a sequential 
stochastic model for collaborative recommendation and describe the statis- 
tical problem. In the model we analyze, unrated items are estimated by 
averaging ratings of users who are "similar" to the user we would like to ad- 
vise. The similarity is assessed by a cosine-type measure, and unrated items 
are estimated using a fc„-nearest neighbor-type regression estimate, which 
is indeed one of the most widely used procedures in collaborative filtering. 
It turns out that the choice of the cosine proximity as a similarity measure 
imposes constraints on the model, which are discussed in section 3. Under 
mild assumptions, consistency of the estimation procedure is established in 
section 4, whereas rates of convergence are discussed in section 5. Illustrative 
examples are given throughout the document, and proofs of some technical 
results are postponed to section 6. 

2 A model for collaborative recommendation 
2.1 Ratings matrix and new users 

Suppose that there are d + 1 {d > 1) possible items, n users in the ratings 
matrix (i.e., the database) and that users' ratings take values in the set 
({0} U [1, s])'^"'"^. Here, s is a real number greater than 1 corresponding to 
the maximal rating and, by convention, the symbol means that the user 
has not rated the item (same as "NA"). Thus, the ratings matrix has n rows, 
d + 1 columns and entries from {0} U [1, s]. For example, n = 8, d = 5 and 
s = 10 in Table [Tj which will be our toy example throughout this section. 
Then, a new user Bob reveals some of his preferences for the first time, rating 
some of the first d items but not the {d + l)th (the movie Titanic in Table 
[T]). We want to design a strategy to predict Bob's rating of Titanic using: 
(i) Bob's ratings of some (or all) of the other d movies and (ii) the ratings 
matrix. This is illustrated in Table [H, where Bob has rated 4 out of the 5 
movies. 

The first step in our approach is to model the preferences of new user Bob 
by a random vector (X, Y) of size d+1 taking values in the set [1, sY x [1, s]. 
Within this framework, the random variable X = {Xi, . . . , Xd) represents 
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Bob's preferences pertaining to the first d movies, whereas Y , the (unob- 
served) variable of interest, refers to the movie Titanic. In fact, as Bob does 
not necessarily reveals all his preferences at once, we do not observe the vari- 
able X, but instead some "masked" version of it denoted hereafter by X*. 
The random variable X* = (X*, . . . , X^) is naturally defined by 



where M stands for some non-empty random subset of {1, . . . , c?} indexing 
the movies which have been rated by Bob. Observe that the random variable 
X* takes values in ({0} U and that ||X*|| > 1, where ||.|| denotes the 

usual Euclidean norm on W^. In the example of Table [Tj M = {2, 3, 4, 5} and 
(the realization of) X* is (0, 3, 3, 4, 5). 

We follow the same approach to model preferences of users already in the 
database (Jim, James, Steve, Mary, etc. in Table [T]), who will therefore 
be represented by a sequence of independent [1,5]^^ x [1, s]-valued random 
pairs (Xi, Yi), . . . , (X„, y„) from the distribution (X, F). A first idea for 
dealing with potential non-responses of a user i in the ratings matrix {i = 
1, . . . ,n) is to consider in place of Xj = {Xn, . . . , Xid) its masked version 
Xj = {Xii, . . . , Xid) defined by 



where each Mi is the random subset of {1, ... ,d} indexing the movies which 
have been rated by user i. In other words, we only keep in Xj items corated 
by both user i and the new user — items which have not been rated by X 
and Xj are declared non-informative and simply thrown away. 

However, this model, which is static in nature, does not allow to take into 
account the fact that, as time goes by, each user in the database may reveal 
more and more preferences. This will for instance typically be the case in 
the movie recommendation system of Table [Tj where regular customers will 
update their ratings each time they have seen a new movie. Consequently, 
model (12. ip is not fully satisfying and must therefore be slightly modified to 
better capture the sequential evolution of ratings. 

2.2 A sequential model 

A possible dynamical approach for collaborative recommendation is based 
on the following protocol: users enter the database one after the other and 




Xj if j eM 
otherwise. 




Xij if j e Mi n M 

otherwise. 



(2.1) 
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update their list of ratings sequentially in time. More precisely, we suppose 
that at each time i = 1,2,..., a new user enters the process and reveals 
his preferences for the first time, while the i — 1 previous users are allowed 
to rate new items. Thus, at time 1, there is only one user in the database 
(Jim in Table [I]), and the (non-empty) subset of items he decides to rate is 
modeled by a random variable Ml taking values in V*{{1, . . . ,d}), the set 
of non-empty subsets of {1, . . . ,d}. At time 2, a new user (James) enters 
the game and reveals his preferences according to a "P^dl, c?})- valued 
random variable Mg, with the same distribution as Ml. At the same time, 
Jim (user 1) may update his list of preferences, modeled by a random variable 
Mf satisfying Ml C Mf. The latter requirement just means that the user is 
allowed to rate new items but not to remove his past ratings. At time 3, a 
new user (Steve) rates items according to a random variable Mg distributed 
as Ml, while user 2 updates his preferences according to M| (distributed as 
) and user 1 updates his own according to Mf, and so on. This sequential 
mechanism is summarized in Table [2l 





Time 1 


Time 2 




Time i 




Time n 


User 1 
User 2 


Ml 


Mf 
Ml 




M{ 
Mif^ 




Ma"-^ 


User i 








Ml 






User n 












M^ 

n 



Table 2: A sequential model for preference updating. 



By repeating this procedure, we end up at time n with an upper triangular 
array {Ml)i<i<n,i<i<n+i-i of random variables. A row in this array consists 
of a collection M/ of random variables for a given value of z, taking values 
in ^^({l, . . . , (i}) and satisfying the constraint M/ C M/"*"^. For a fixed i, 
the sequence M} C Mf C . . . describes the (random) way user i sequentially 
reveals his preferences over time. Observe that the later inclusions are not 
necessarily strict, so that a single user is not forced to rate one more item at 
every single step. 

Throughout the paper, we will assume that, for each z, the distribution of the 
sequence of random variables (M")„>i is independent of and is therefore 
distributed as a generic random sequence denoted (M"^)„>i, satisfying M^ ^ 
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and C M"'^^ for all n > 1. For the sake of coherence, we assume that 
and M (see f l2.1l) ) have the same distribution, i.e., the new abstract user 
X* may be regarded as a user entering the database for the first time. We 
will also suppose that there exists a positive random integer no such that 
M^" = {1, . . . ,d} and, consequently, = {1, . . . ,d} for all n > no. This 
requirement means that each user rates all d items after a (random) period of 
time. Last, we will assume that the pairs (Xj, ¥{), i = 1, . . . ,n, the sequences 
(M")„>i, (M2 )n>i, • • • and the random variable M are mutually independent. 
We note that this implies that the users' ratings are independent. 

With this sequential point of view, improving on (12. ip . we let the masked 
version Xf ^ = (X^"\ . . . , Xl^^) of Xi be defined as 

otherwise. 

Again, it is worth pointing out that, in the definition of X^^"''*, items which 
have not been corated by both X and Xj are deleted. This implies in par- 
ticular that X-""^ may be equal to 0, the (i-dimensional null vector (whereas 
||X*|| > 1 by construction). 

Finally, in order to deal with possible non-answers of database users regarding 
the variable of interest (Titanic in our movie example), we introduce (Jln)n>i, 
a sequence of random variables taking values in P*({1, . . . ,n}), such that 
TZn is independent of M and the sequences (M")„>i, and satisfying Tin C 
7^,1+1 for all n > 1. In this formalism, 7^„ represents the subset, which is 
assumed to be non-empty, of users who have already provided information 
about Titanic at time n. For example, in Table [Tj only James, Mary, John, 
Lucy and Johanna have rated Titanic and therefore (the realization of) TZn 
is {2,4,5,6,8}. 

2.3 The statistical problem 

To summarize the model so far^ we have at hand at time n a sample of 
random pairs (x|^"\ Yi), . . . , (X„ , y„) and our mission is to predict the score 
y of a new user represented by X*. The variables x["\ . . . , X^T'' model the 
database users' revealed preferences with respect to the first d items. They 
take values in ({0} U [1, s]Y, where a at coordinate j of X-"^ means that 
the jth product has not been corated by both user i and the new user. The 
variable X* takes values in ({0}U [1, s]Y and satisfies ||X*|| > 1. The random 
variables Yi, . . . , F„ model users' ratings of the product of interest. They take 
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values in [1, s] and, at time n, we only see a non-empty (random) subset of 
{Yi, . . . , Yn}, indexed by 11^. 

The statistical problem with which we are faced is to estimate the regression 
function ^^(x*) = E[y|X* = x*]. For this goal, we may use the database ob- 
servations (x["\ Fi), . . . , (X^T'', Yn) in order to construct an estimate ?7„(x*) 
of ?7(x*). The approach we explore in this paper is a cosine-based /c„-nearest 
neighbor regression method, one of the most widely used algorithms in col- 
laborative filtering (e.g., Sarwar et al. [15]). 

Given x* E ({0}U[1, s]Y-0 and the sample {X^^\ Yi),..., (xi"\ Yn), the idea 
of the cosine-type fc^-nearest neighbor (NN) regression method is to estimate 
?7(x*) by a local averaging over those Yi for which: (i) X-"^ is "close" to 
X* and (ii) i G 7^„, that is, we effectively "see" the rating F,. For this, 
we scan through the kn neighbors of x* among the database users Xf) for 
which i G 7^„ and estimate ?7(x*) by averaging the kn corresponding Fj. The 
closeness between users is assessed by a cosine-type similarity, defined for 
X = (xi, ...,Xd) and yi' = {x[, . . . , x'^) in ({0} U [1, s\Y by 



5(x, x') 



where J' = {j E {!,..., c?} : xj ^ and 7^ 0} and, by convention, 
5(x, x') = if = 0. To understand the rationale behind this proxim- 
ity measure, just note that \i J = {!,..., c?} then S'(x, x') coincides with 
cos(x, x'), i.e., two users are "close" with respect to S if their ratings are 
more or less proportional. However, the similarity S, which will be used to 
measure the closeness between X* (the new user) and X-"^ (a database user) 
ignores possible non-answers in X* or X -"^ , and is therefore more adapted to 
the recommendation setting. For example, in Table [11 

5(Bob, Jim) = ^((0, 3, 3, 4, 5), (0, 6, 7, 8, 9)) = ^((3, 3, 4, 5), (6, 7, 8, 9)) ^ 0.99, 
whereas 

5(Bob, Lucy) = 5((0, 3, 3, 4, 5), (3, 10, 2, 7, 0)) = S{{3, 3, 4), (10, 2, 7)) ^ 0.89. 

Next, fix X* E ({0}U[1, sj^-O and suppose to simplify that M C M"+^~^ for 
each i E Tin- In this case, it is easy to see that X^"-* = X* = (X*^, . . . , X*^), 
where 

I otherwise. 
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Besides, Yi > 1, 

5(x^ x^) = cos(x^ x^) > 0, (2.2) 

and an elementary calculation shows that the positive real number y which 
maximizes the similarity between (x*,?/) and (X*,l^), that is 



5((x^y),(x^F,)) 

is given by 



y ||X^||cos(x^Xt)^^■ 
This suggests the following regression estimate ?7n(x*) of rj{y^): 

vni^n = 11x11 Yl ^™(^*)^' (2.3) 

where the integer fc„ satisfies 1 < kn < n and 

T/f/ ( *^ — / '^/^n if X* is among the fc„-MS of x* in {X*, i G Tin} 
Wni[^)-^ Q otherwise. 

In the above definition, the acronym "MS" (for Most Similar) means that we 
are searching for the kn "closest" points of x* within the set {X*,-? G Tin} 
using the similarity S — or, equivalently here, using the cosine proximity 
(by identity (12.21) ). Note that the cosine term has been removed since it has 
asymptotically no influence on the estimate, as can be seen by a slight adap- 
tation of the arguments of the proof of Lemma 6.1, Chapter 6, in Gyorfi et al. 
[U]. The estimate ?7„(x*) is called the cosine-type k^-NN regression estimate 
in the collaborative filtering literature. Now, recalling that definition (12. 3p 
makes sense only when M C Mf+^-^ for each i G 7^„ (that is, xf ^ = X^), 
the next step is to extend the definition of ?7n(x*) to the general case. In 
view of (12.31) . the most natural approach is to simply put 

r^„(x^) = 11x11 iy„(x'^)-^, (2.4) 

where 

otherwise. 
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The acronym "MS" in the weight Wnii'x*) means that the A;„ closest database 
points of X* are computed according to the similarity 



S (x^ Xf)) = pt'^S (x*, Xf)) , with 



|Mf+^"^ n M| 

m 



(here and throughout, notation \A\ means the cardinality of the finite set A). 
The factor p-"^ in front of 5* is a penalty term which, roughly, avoids to over- 
promote the last users entering the database. Indeed, the effective number 
of items rated by these users will be eventually low and, consequently, their 
;5-proximity to x* will tend to remain high. On the other hand, for fixed i 
and n large enough, we know that M C M""*"^"* and X^"^ = X*. This implies 
pf^^ = 1, 5'(x*,X-"^) = S'(x*,X*) = cos(x*,X*) and shows that definition 
(12.41) generalizes definition (12. 3p . Therefore, we take the liberty to still call 
the estimate (12. 4p the cosine-type /c„-NN regression estimate. 

Remark 2.1 A smoothed version of the similarity S could also be considered, 
typically 

5(x^xS"))=V'(pr^)^(x^x;")), 

where : [0, 1] [0, 1] is a nondecreasing map satisfying ip{l/2) < 1 
(assuming \M\ > 2). For example, the choice ip{p) = ^Jp tends to promote 
users with a low number of rated items, provided the items corated by the new 
user are quite similar. In the present paper, we shall only consider the case 
ip{p) = P, but the whole analysis carries over without difficulties for general 
functions ip. 

Remark 2.2 Another popular approach to measure the closeness between 
users is the Pearson correlation coefficient. The extension of our results to 
Pearson-type similarities is not straightforward and more work is needed to 
address this challenging question. We refer the reader to Choi et al. ^ and 
Montaner et al. fT^ for a comparative study and comments on the choice of 
the similarity. 

Finally, for definiteness of the estimate ?7„(x*), some final remarks are in 
order: 

(z) If X^^^ and Xj.''^ are equidistant from x^ i.e., 5(x^ xj"^) = 5(x*, xj"^), 

(n) 

then we have a tie and, for example, X) may be declared "closer" to 
X* if i < j, that is, tie-breaking is done by indices. 

(a) If \7ln\ < kn, then the weights Wni{x*) are not defined. In this case, 
we conveniently set PV„i(x*) = 0, i.e., ?7„(x*) = 0. 
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{in) If Xj = 0, then we take lV„j(x*) = and we adopt the convention 
X oo = for the computation of r7„(x*). 

(iv) With the above conventions, the identity J2ien„ ^mi^*) ^ 1 holds in 
each case. 



3 The regression function 

Our objective in section 4 will be to establish consistency of the estimate 
rjni'x*) defined in (12. 4p towards the regression function ?7(x*). To reach this 
goal, we first need to analyze the properties of ?7(x*). Surprisingly, the special 
form of r7n(x*) constrains the shape of r7(x*). This is stated in Theorem 13.11 
below. 

Theorem 3.1 Suppose thatrin(X.*) vO^*) in probability asn ^ oo. Then 
r7(X*) = ||X*||E 



Y 




_||X^|| 


||X*||_ 



a.s. 



Proof of Theorem 13.11 Recall that 

rjniX.*) = ||X*|| J2 ^-(X* 

and let 



Y 



IX 



Wii' 



Y. 



IX 



Since (?7„(X*))„ is a Cauchy sequence in probability and ||X*|| > 1, ((y9„(X*))„ 
is also a Cauchy sequence. Thus, there exists a measurable function on 
such that 9?„(X*) V'(X*) in probability. Using the fact that < V9„(X*) < 
s for all n > 1, we conclude that < V5(X*) < s cl.S. clS well. 

Let us extract a sequence {nk)k satisfying y9„^(X*) '/'(X*) a.s. Observing 
that, for X* 7^ 0, 

' x" 



if. 



we may write v^(X*) = y9(X*/||X*||) a.s. Consequently, the limit in proba- 
bility of (r7„(X*))„ is 



|X*|| ^ 



ix^ 
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Therefore, by the uniqueness of the hmit, //(X* 
Moreover, 



|X1|^(XV||X1|) a.s. 



IX* 



E 
E 
E 
E 



X* 



IX* 



X* 



IX* 



r/(X* 



IX* 



X* 



E 



Y 



X* 



X*|| 
X* 



X* 



X* 



Y 



X* 



X* 



X* 



since (t(X*/||X*||) C (t(X*). This concludes the proof of the theorem. 



□ 



An important consequence of Theorem 13.11 is that if we intend to prove any 
consistency result regarding the estimate ?7„(x*), then we have to assume 
that the regression function r7(x*) has the special form 



?7(x*) = ||x*||v?(x*), where ^^(x*) = E 



Y 



IX* 



X* 



IX* 



This will be our fundamental requirement throughout the paper, and it will 
be denoted by (F). In particular, if x* = Ax* with A > 0, then r]{S^*) = 
Xr]{'x*). That is, if two ratings x* and x* are proportional, then so must be 
the values of the regression function at x* and x*, respectively. 



4 Consistency 

In this section, we establish the Li consistency of the regression estimate 
?7„(x*) towards the regression function 77 (x*). Using Li consistency is essen- 
tially a matter of taste, and all the subsequent results may be easily adapted 
to Lp norms without too much effort. In the proofs, we will make repeated 
use of the two following facts. Recall that, for a fixed i G TZn, the random 
variable X* = {X*^, . . . , X*^) is defined by 

otherwise, 

and Xj-"'' = X* as soon as M C M""*"^"*. Recall also that, by definition, 
l|X*|| > 1. 
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Fact 4.1 For each i G TZn, 

S{X*, X*) = S{X\ Xt) = cos(X^ Xt) = l-U''^ 



2 viix-^iriix? 

where d is the usual Euclidean distance on M.'^. 
Fact 4.2 Let, for all i > 1, 

Ti = mm{k > i : Mf+^-* D M) 
be the first time instant when user i has rated all the films indexed by M. Set 

= {z G 7^„ : < n}, (4.1) 

and define, for i G Cn, 

W* (x*) = 1 ^-^"^^ among the K-MS of x* in {X.*,i G 

" 1 otherwise. 

Then 

[ otherwise, 

where the k^-NN are evaluated with respect to the Euclidean distance on W^. 
That is, the W^^ix*) are the usual Euclidean NN weights (Gydrfi et al. J3i), 
indexed by the random set Cn- 

Recall that |7^„| represents the number of users who have already provided 
information about the variable of interest (the movie Titanic in our example) 
at time n. We are now in a position to state the main result of this section. 

Theorem 4.1 Suppose that \M\ > 2 and that assumption (F) is satisfied. 
Suppose that kn — > oo, \7ln\ ^ oo a.s. and E[A;„/|7^,„|] ^0 as n ^ oo. Then 

E\r]n{X*) -ri{X*)\^0 as n ^ oo. 

Thus, to achieve consistency, the number of nearest neighbors fc„, over which 
one averages in order to estimate the regression function, should on the one 
hand tend to infinity but should, on the other hand, be small with respect 
to the cardinality of the subset of database users who have already rated the 
item of interest. We illustrate this result by working out two examples. 
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Example 4.1 Consider, to start with, the somewhat ideal situation where 
all users in the database have rated the item of interest. In this case, TZn = 
{1, . . . ,n}, and the asymptotic conditions on kn become A;„ — > oo and kn/n 
as n oo. These are just the well-known conditions ensuring consistency 
of the usual (i.e., Euclidean) NN regression estimate (Gyorfi et al. ^9], Chap- 
ter 6). 

Example 4.2 In this more sophisticated model, we recursively define the 
sequence {7ln)n as follows. Fix, for simplicity, TZi = {1}. At step n > 2, 
we first decide or not to add one element to IZn-i with probability p G (0, 1), 
independently of the data. If we decide to increase IZn, then we do it by 
picking a random variable Bn uniformly over the set {1, . . . , n} — IZn~i, and 
set TZn = IZn-i U {Bn}; otherwise, TZn = Hn-i- Clearly, \TZn \ — 1 is a sum of 
n — 1 independent Bernoulli random variables with parameter p, and it has 
therefore a binomial distribution with parameters n — 1 and p. Consequently, 



E 



kr, 



7^. 



kn[l 



pY 



np 



In this setting, consistency holds provided kn —>■ oo and kn = o{n) as n 



oo. 



In the sequel, the letter C will denote a positive constant, the value of which 
may vary from line to line. Proof of Theorem 14. II will strongly rely on facts 
21 and the following proposition. 



Proposition 4.1 Suppose that \M\ > 2 and that assumption (F) is satisfied. 
Let ani = P(M"+i~^ ^ M\M). Then 



E|r/„(X*)-r7(X^)| 











kn 


+ E 







7^ 



+ E 



+e|$^w^:,(x*)^-^(x^) 



where TZn stands for the non-empty subset of users who have already provided 
information about the variable of interest at time n and Cn is defined in fl4.ip . 

Proof of Proposition 14.11 Since ||X*|| < s\/d, it will be enough to 
upper bound the quantity 



E 



X 
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To this aim, we write 



E 



< E 



X 



+ E 



X 



where the symbol A'^ denotes the complement of the set A. Let the event 



A 



3teCl: Xf"^ is among the k^-MS of X* in {Xf'^i e 7^„} 



(n) 



Since J^iec ^miX*) < 1, we have 



E 



J2 ^-(x* 



Y 



IX 



E 



Y 



IX, 



Observing that, for i G xj"^ = X^ and iy„i(X*)l^c = W^*.(X*)l^c (fact 
14.21) ■ we obtain 



E 



Y 



X 



E 



E 



1^ 



1^ 



E 



IX? 



L /ic 



< sP(A) + E 



Applying finally Lemma 16.51 completes the proof of the proposition. 



We are now in a position to prove Theorem 14.11 



□ 



Proof of Theorem 14. IL According to Proposition 14.11 Lemma 16.11 and 
Lemma 16.21 the result will be proven if we show that 



E 



Y 



as n ^ oo. 
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For L„ GP({l,...,n}), set 



kn ^ l^pc^ is among the fc„-NN of ^^^ly in |pc^>ieLn jj ||X*|| 



Conditionally on the event [M = m], the random variables X* and {X*,i G 
Ln} are independent and identically distributed. Thus, applying Theorem 
6.1 in [9], we obtain 

Ve > 0, 3Am>l:K> and \^ > ^ ^m\ZU < ^, 



k 



n 



where we use the notation E„J.] = E[.|M = m]. Let P„(.) = P(.|M = m). 
By independence, 

'^m\Z2„\ = I Pml-^n = -^'n)• 

L„eP({l,...,n}) 

Consequently, letting A = maxA^, where the maximum is taken over all 
possible choices of m G P*({1, . . . , c?}) we get, for all n such that kn > A, 

L„e-P({l,...,n}) 

l-^n I ^ A/Cti 



L„eP({l,...,n}) 

I Lfi I <-A/lji 

<£ + sP„(|i:„| < Akn). 



Therefore 

E|Z2J =E[E 0^2J 1^]] <£ + sP(|/:„| < Afc^. 
Moreover, by Lemma [6. 2 ^ 

\^n\ iT^nl _ l^n I j — ). qo in probability as n — >• oo. 



Thus, for all e > 0, limsup„^oo^l^£„l — ^) whence E|Z2^| ^ as n oo. 
This shows the desired result. 

□ 
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5 Rates of convergence 



In this section, we bound the rate of convergence of E |?7„(X*) — 7]{'X.*)\ for 
the cosine-type fc„-NN regression estimate. To reach this objective, we will 
require that the function 



E 



Y 



X* 



satisfies a Lipschitz-type property with respect to the similarity S. More 
precisely, we say that ip is Lipschitz with respect to 5* if there exists a constant 
C > such that, for all x and x' in M*^, 



|y.(x)-y,(x')|<C^l-^(x,x'). 

In particular, for x and x' G R'' — with the same null components, this 
property can be rewritten as 

|,(x)-,(x')|<^d(^,^ 

where we recall that d denotes Euclidean distance. 

Theorem 5.1 Suppose that assumption (F) is satisfied and that (f is Lip- 
schitz with respect to S. Let a„i = P(M"'"'"-^~* 7^ M\M), and assume that 
\M\ > 4. Then there exists C > such that, for all n > 1, 



EK(X^)-r/(X^)| 



< C<E 



k, 



|7^„ 



dr. 



ien„ 



+ E 



+ E 



Pn 



where Pn = 1/(|M| — 1) if kn < I'^nl; o,nd -P„ = 1 otherwise. 



To get an intuition on the meaning of Theorem 15.11 it helps to note that 
the terms depending on a„j do measure the influence of the unrated items 
on the performance of the estimate. Clearly, this performance improves as 
the ani decrease, i.e., as the proportion of rated items growths. On the other 
hand, the term E[(/c„/|7^„|)^"] can be interpreted as a bias term in dimension 
\M\ — 1, whereas l/\/k^ represents a variance term. As usual in nonpara- 
metric estimation, the rate of convergence of the estimate is dramatically 
deteriorated as \M\ becomes large. However, in practice, this drawback may 
be circumvented by using preliminary dimension reduction steps, such as 
factorial methods (PCA, etc.) or inverse regression methods (SIR, etc.). 
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Example 5.1 (cont. Example 14.1)) Recall that we assume, in this ideal 
model, that Tin = {1, • • • Suppose in addition that M = {1, . . . ,d}, i.e., 
any new user in the database rates all products the first time he enters the 
database. Then the upper bound of Theorem \5.1\ becomes 

i/{d^i) , 
E|r^„(X^)-r^(X^)| = 0| (-) + 




Since neither TZn nor M are random in this model, we see that there is no 
influence of the dynamical rating process. Besides, we recognize the usual 
rate of convergence of the Euclidean NN regression estimate ( Gydrfi et al. 

Chapter 6) in dimension d—1. In particular, the choice ~ 77,2/(<^+i) 
leads to 

E|r/„(X*) -r7(X*)| = O {n-^/^'^+^^) . 

Note that we are led to a d — 1-dimensional rate of convergence (instead of 
the usual d) just because everything happens as if the data is projected on the 
unit sphere o/M*^. 

Example 5.2 (^cont. Example 14.2)) In addition to model \4^ we suppose 
that at each time, a user entering the game reveals his preferences accord- 
ing to the following sequential procedure. At time 1, the user rates exactly 4 
items by randomly guessing in {1, . . . ,d}. At time 2, he updates his prefer- 
ences by adding exactly one rating among his unrated items, randomly chosen 
in {1, . . . ,d} — M^. Similarly, at time 3, the user revises his preferences ac- 
cording to a new item uniformly selected in {1, . . . ,d} — Mf, and so on. In 
such a scenario, \M^ = min((i, j+3) and thus, = {1, . . . ,d} for j > d—3. 
Moreover, since \M\ = 4, a moment's thought shows that 

ifi<n-d + A 

'd-A^ 



d 

Assuming n > d — 5, we obtain 

n 

^ ftni < ^ 0!ni 
i&TZn i=n—d+5 



^ / r \ ifn~d-\-5<i<n. 



^ A / (n + 4 - t){n + 3- i){n + 2 - t){n + 1 - i) \ 
",=±^+5V did-l){d-2)id-3) J 

<{d-A) (l- 



i=n~d+5 

24 



d{d - l){d - 2){d - 3) 
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Similarly, letting TZno = Tin H {n — d + 5, . . . , n}, we have 



~ W l{min(-R,„)>n-d+5} 



< 1 



24 



d{d - l){d - 2){d - 3) 



\Tln0\ 



L{mm(7e„)>n-d+5}- 



Since \7ln\ — 1 has binomial distribution with parameters n — 1 and p, we 
obtain 



E 



n 



< P(min(7^„) > n - rf + 5) 



< F^|7^„| < < -. 

Finally, applying Jensen's inequality, 



E 



IK 



E 



1/3 



< C E 



|7^„| 



l{fc„<|7^„|} 
1/3 



+ E 



{kn>\nn\} 



Putting all the pieces together, we get with Theorem \5.1\ 



EK(X^)-r/(X^)| = 



n 



1/3 



In particular, the choice kn ~ n^/^ /eac^s to 

E|r/„(X^)-r7(X^)| = 0(n-i/^), 

which is the usual NN regression estimate rate of convergence when the data 
is projected on the unit sphere ofW^. 



Proof of Theorem 15.11 Starting from Proposition I4.H we just need to 
upper bound the quantity 



E 



IX? 
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A combination of Lemma [6.61 and the proof of Theorem 6.2 in [9] shows that 



E 



< c 



E 



l/(|Afhl) 



p(/:n = 0) 



(5.i: 



We obtain 



E 



E 

+ E 



1/(|M1-1) 



km 



l/(|Af|-l) 



7^„|(l-|/:^|/|7^„|) 

, X 1/(1^^1-1) 



L{|£a|<|7e„|/2} 



< E 



l{|a|>|7J„|/2}l{£„^0} 



1/(|M|-1)- 



+ E[A:y(|MI-i)i^l^^^,^l^^,/^^] 



Since |M| > 4, one has 21/(1^^1-^) < 2 and fc^^' ' ^ < in the rightmost 
term, so that, thanks to Lemma [6.21 



E 



l/(|A/hl) 



r 

'~'n 

< cJe 



|7^„| 



l/(|Af|-l)- 



+ E 



^ y E 



|7^„| : 



Or 



The theorem is a straightforward combination of Proposition 14.11 inequahty 
fl5.ll) . and Lemma [UTTl 

□ 



6 Technical lemmas 

Before stating some technical lemmas, we remind the reader that TZn stands 
for the non-empty subset of {1, . . . ,n} of users who have already rated the 
variable of interest at time n. Recall also that, for alH > 1, 



Ti = min(A; > i : Mf+^~* D M) 
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and 



Cn = {i eTZn-Ti< n]. 



Lemma 6.1 We have 

= 0) = E 



as n oo. 



Proof of Lemma 16.11 Conditionally on M and TZn, the random variables 
{Ti,i G 7l„} are independent. Moreover, the sequence (M"')„>i is nonde- 
creasing. Thus, the identity [Ti > n] = [M"+^-^ ^ M] holds for all i G 7^„. 
Hence, 



P(£„ = 0) = P (Vi e 7^„ : Ti > n) 

= E^F e Tin ■■Ti> n 

= E 



7^„,M 



n 



> n 



nn,M 



E 



(by independence of (M['+^~\ M) and 7^, 

J67?.„ 

The last statement of the lemma is clear since, for all i, ani a.s. as 
n ^ oo. 



□ 



Lemma 6.2 We have 



E 



\'~'n\ 






= E 


J7^„|. 





ar 



and 









1 




E 


. 1 " 1 


< 2E 


J7^„L 


+ 2E 



Moreover, i/lim„^oo \T^n\ = oo a.s., then 



lim E 

n— >oo 



l-^nl 
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Proof of Lemma 16.21 First, using the fact that the sequence (M"')„>i is 
nondecreasing, we see that for all i G 7^„, [Tj > n] = [Mf+^"* ^ M]. Next, 
recalling that TZn is independent of Tj for fixed i, we obtain 





\^n\ 






E 












|7^„| 



^— J2 nM^^'-' ^ M), 



and this proves the first statement of the lemma. Now define J'n = {n + 1 — 
i,i ^ TZn} and observe that 



E 









= E 







J] P(M^- 75 M) 



where we used \J'n\ = iT^nl- Since, by assumption, |i7n| = \T^n\ — > oo a.s. as 
n — s> oo and ¥{M^ ^ M) ^ as j ^ oo, we obtain 

lim V P(M^' 7^ M) = a.s. 

The conclusion follows by applying Lebesgue's dominated convergence The- 
orem. The second statement of the lemma is obtained from the following 
chain of inequalities: 



E 



= E 

= E 

+E 

< 2E 

< 2E 



1 



|7^„|(l-|£^|/|7^J) 
1 

\n,,\{i-\c<^\/\nn\) 
1 



l{/:n^0} 

M\C'k\<\Tl„\/2} 



1 



|7^. 
1 



rl{|£Sil>|7e„|/2}l{£„^0} 



2E 



Applying the first part of the lemma completes the proof. 



□ 
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Lemma 6.3 Denote by Z* and the random variables Z* = X*/||X* 
Z* = X*/||X^||, and let ^{Z*) = ¥{S{Z\ Z*) > 1/2 | Z*). Then 



P 



(2A;„> |£„|e(Z*)|£„,M) <2E 

+ E 



k 



7^n| 



E 



Le(z* 



M 



|7^„| 



Proof of Lemma 16.31 If M is fixed, Z* is independent of and 7^„. 
Thus, by Markov's inequality. 



p(2A;„> |/:„|e(z^)|/:„,M,7^ 
= p(2fc„ > |7^„|e(z*) - |/:^|e(z*) |£„,M,7^„) 
= p(2fc„ + |/:^ie(z*) > |7^„,|e(z*) |/:„,M,7^, 



< 



2 



rE 



M 



+ 



7^. 



The proof is completed by observing that Ttn and M are independent random 
variables. 

□ 

Let S(x, e) be the closed Euclidean ball in centered at x of radius e. Recall 
that the support of a probability measure [i is defined as the closure of the 
collection of all x with /i(i3(x, e)) > for all e > 0. The next lemma can be 
proved with a slight modification of the proof of Lemma 10.2 in Devroye et 
ah [H]. 

Lemma 6.4 Let ^ be a probability measure on M.'^ with a compact support. 
Then 

with C > a constant depending upon d and r only. 
Lemma 6.5 Suppose that \M\ > 2, and let the event 



Then 



^ieC^: Xf ^ zs among the K-MS ofX* m {Xf \ i G 7^„} 



(n) 



P(A) < C <^ E 









|e 




+ E 




.|7^„L 



+ E 
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Proof of Lemma 16.51 Recall that, for a fixed i G TZn, the random variable 
X^ = (XA,...,X,^) is defined by 



Xij ifjeM 



] otherwise, 
and Xf ^ = X* as soon as M C Mf+^-\ 
We first prove the inclusion 

An C e Cn : 5(X^Xp > 1/2}| < kn] • (6.1) 

Take i & such that X^^"^ is among the fc^-MS of X* in {X.f^\i G TZn}- 
Then, for all j G such that 5'(X*, X*) > 1/2, we have 



^(x^xp > - > J9^5(x^x^) = 5(x^xl 



2 

since pj"^ < 1 - 1/\M\ < 1/2 if |M| > 2. If 



|{jG/:„, :5(X*,Xp>l/2}|>fc„, 

then X-"^ is not among the /c„-MS of X* among the {X-"\i G 7^„}. This 
contradicts the assumption on X^-'^'' and proves inclusion (16.11) . 

Next, define Z* = XV||X*||, = X^/||X^||, z = 1, . . . , n, and let ^(Z*) = 

p(5(z^z^) > 1/2 |z^). k k - ic^ia'^*) < -(imcnia^*) and £„ ^ 0, 

we deduce from (16. ip that 



■^n, Z* 



p (a I z- 

< P ( 5^ l{5(Z^zp>l/2} < fcn 



< 



P ( 5^ (l{5(z*,z^*)>i/2} - e(Z^)) < A;„ - |£„|e(Z' 

P ( 5^ (l{5(Z*,Z;)>l/2} -e(Z'^)) < -^|/:n|e(Z*) 



Z* 



-^rt) Z* 



< 4|£„|e(Z 



(i£„ie(z-))' i^nie(z^) 

(by Tchebychev's inequality). 
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In the last inequality, we use the fact that, since a{M) C a{Z*), the random 
variables {Z*,z G are independent conditionally on Z* and £„. Using 
again the inclusion a{M) C cr(Z*), we obtain, on the event [Cn 7^ 0], 



= E 


P 


4 


'.E 


< 






4 








1 

n 1 



^n,, Z^ 



e(z* 
1 



M 



Applying Lemma [673| on the event 7^ 0], 



4 ^ 

*-"n. 



,e(z* 



M 



+ 2E 



E 



M 



+ E 



Moreover, by fact I4.H 



^(Z*) = P S{Z\Z\) > 



> P d\Z\Z\) < 



Thus, denoting by z/*^ the distribution of Z* conditionally to M, we deduce 
from Lemma [6.41 that 



E 



,e(z^ 



M 



< 



z/'^(dz) <C, 



where the constant C does not depend on M. Putting all the pieces together, 
we obtain 



P( A) < C <^ E 

















|e 


1^ |1{£„5^0} 
. 1 " 1 


+ E 




+ E 


.|7^„L 





n^^n = 0). 



We conclude the proof with Lemma 16.11 and Lemma 16.21 



□ 



In the sequel, we let X*-^^, . . . , X*j^^|j be the sequence {X*, i G £„} reordered 
according to decreasing similarities S'(X*,X*),i G £„, that is, 

5(x^x^,))>...>5(x^x^l^„l)). 

Lemma [6.61 below states the rate of convergence to 1 of ^(X*, X*-^-|). 
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Lemma 6.6 Suppose that \M\ > 4. Then there exists C > such that, on 
the event 7^ 0], 



l-E[^(X*,X^i))|M,/:„] < 
Proof of Lemma 16. 6L Observe that 

E[l-5(x^x^,))lx^£„] 



c 



|£ |2/(|Ml-l)• 



/ P (1 - S(X^ X^i)) >e\X\ Cn) d£ 
Jo 

[ P (Vi G /:„ : 1 - 5(X^ X*)>e\ X^ £„) de. 



Since a{M) C a(X*), given X* and £„, the random variables {X*,i G £ri} 
are independent and identically distributed. Hence, 

E [1 - S{X\ X^,)) I X^ Cn] = f [P (1 - 5(X^ Xt) > ^ | X^)] '"^-l de. 

Denote by the conditional distribution of X*/||X*|| given M. The support 
of is contained in both the unit sphere of and in a |M|-dimensional 
vector space. Thus, for simplicity, we shall consider that the support of z/*^ is 
contained in the unit sphere of R'^^L Let i3'^''^'(x, r) be the closed Euclidean 
ball in RI^^I centered at x of radius r. Since X* (resp. X^) only depends 
on M and X (resp. Xi), then, given X*, the random variable X*/||X*|| is 
distributed according to v^^ . Thus, for any £ > 0, we may write (fact 14.11) 



P (1 - S{X.\ X.\)> e\lC) = l-v 
and, consequently. 



M 



X* 



E[l-5(x^x^l))lx^/:„] = 

Using the inclusion a{M) C cr(X*), we obtain 
E[l-5(X^X^l))|M,/:„] 



IX* 



X* 
IX* 



\c„\ 



de. 



E 



1 - B\'' 



X* 

ix* 



de. 



(6.2) 
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Fix 6 > 0, and denote by S{M) the support of u^'^ . There exists Euchdean 
balls Ai,..., AAr(e) in RI^^I with radius V2e/2 such that 



N(e) 



S{M) C IJ Aj and N{e) 



< 



C 



{|M|-l)/2^ 



e 



for some C > which may be chosen independently of M. Clearly, if x G 
Aj n S{M), then Aj C i3l^^l(x, v^). Thus, 



E 



< 



1 _ I^QlM 

N(e) 



X 

1 - u'' i B^' 



\Cn\ 



M, Cn 



IX* 



l-Cnl 



z/*^(dx) 



<E / (l-^*^(A,))'^"'^"^(dx) 

7V(£) 



< N(e) maxtfl 

tG[0,l] 



< 



C 



\r\£{\M\-i)/2- 



Combining this inequality and equality (16. 2p . we obtain 



E [1 - S{X\ X^i)) I M, Cn] < min (^1, 



C 



de. 



Since |M| > 4, an easy calculation shows that there exists C > such that 

E[l-s(x^x^l))|M,/:„] < ^ 

which leads to the desired result. 



|£^|2/(|M|-l): 



□ 
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